11 research outputs found

    A large scale evaluation of TBProfiler and Mykrobe for antibiotic resistance prediction in Mycobacterium tuberculosis

    Get PDF
    Recent years saw a growing interest in predicting antibiotic resistance from whole-genome sequencing data, with promising results obtained for Staphylococcus aureus and Mycobacterium tuberculosis. In this work, we gathered 6,574 sequencing read datasets of M. tuberculosis public genomes with associated antibiotic resistance profiles for both first and second-line antibiotics. We performed a systematic evaluation of TBProfiler and Mykrobe, two widely recognized softwares allowing to predict resistance in M. tuberculosis. The size of the dataset allowed us to obtain confident estimations of their overall predictive performance, to assess precisely the individual predictive power of the markers they rely on, and to study in addition how these softwares behave across the major M. tuberculosis lineages. While this study confirmed the overall good performance of these tools, it revealed that an important fraction of the catalog of mutations they embed is of limited predictive power. It also revealed that these tools offer different sensitivity/specificity trade-offs, which is mainly due to the different sets of mutation they embed but also to their underlying genotyping pipelines. More importantly, it showed that their level of predictive performance varies greatly across lineages for some antibiotics, therefore suggesting that the predictions made by these softwares should be deemed more or less confident depending on the lineage inferred and the predictive performance of the marker(s) actually detected. Finally, we evaluated the relevance of machine learning approaches operating from the set of markers detected by these softwares and show that they present an attractive alternative strategy, allowing to reach better performance for several drugs while significantly reducing the number of candidate mutations to consider

    Controlling Microgrids Without External Data: A Benchmark of Stochastic Programming Methods

    Full text link
    Microgrids are local energy systems that integrate energy production, demand, and storage units. They are generally connected to the regional grid to import electricity when local production and storage do not meet the demand. In this context, Energy Management Systems (EMS) are used to ensure the balance between supply and demand, while minimizing the electricity bill, or an environmental criterion. The main implementation challenges for an EMS come from the uncertainties in the consumption, the local renewable energy production, and in the price and the carbon intensity of electricity. Model Predictive Control (MPC) is widely used to implement EMS but is particularly sensitive to the forecast quality, and often requires a subscription to expensive third-party forecast services. We introduce four Multistage Stochastic Control Algorithms relying only on historical data obtained from on-site measurements. We formulate them under the shared framework of Multistage Stochastic Programming and benchmark them against two baselines in 61 different microgrid setups using the EMSx dataset. Our most effective algorithm produces notable cost reductions compared to an MPC that utilizes the same uncertainty model to generate predictions, and it demonstrates similar performance levels to an ideal MPC that relies on perfect forecasts

    Predicting bacterial resistance from whole-genome sequences using k-mers and stability selection

    No full text
    Abstract Background Several studies demonstrated the feasibility of predicting bacterial antibiotic resistance phenotypes from whole-genome sequences, the prediction process usually amounting to detecting the presence of genes involved in antibiotic resistance mechanisms, or of specific mutations, previously identified from a training panel of strains, within these genes. We address the problem from the supervised statistical learning perspective, not relying on prior information about such resistance factors. We rely on a k-mer based genotyping scheme and a logistic regression model, thereby combining several k-mers into a probabilistic model. To identify a small yet predictive set of k-mers, we rely on the stability selection approach (Meinshausen et al., J R Stat Soc Ser B 72:417–73, 2010), that consists in penalizing logistic regression models with a Lasso penalty, coupled with extensive resampling procedures. Results Using public datasets, we applied the resulting classifiers to two bacterial species and achieved predictive performance equivalent to state of the art. The models are extremely sparse, involving 1 to 8 k-mers per antibiotic, hence are remarkably easy and fast to evaluate on new genomes (from raw reads to assemblies). Conclusion Our proof of concept therefore demonstrates that stability selection is a powerful approach to investigate bacterial genotype-phenotype relationships

    Adaptation du modèle de temps de promotion à l'étude du délai de détection du VIH chez l'enfant et de la survenue d'autres maladies infectieuses

    No full text
    Nous avons développé un modèle de survie, adapté aux données censurées par intervalle, permettant d étudier le délai de survenue d une maladie infectieuse. Ce modèle permet de modéliser explicitement le délai entre le moment de la transmission et la détection de l infection, notamment lorsque les expositions à l agent infectieux sont multiples. Initialement développé en cancérologie, ce modèle permet de prendre en compte une proportion d individus ne présentant pas l évènement étudié, cad. les individus non contaminés. Dans un premier exemple sur la transmission mère enfant du VIH en Afrique du Sud, nous avons étudié l influence du type d allaitement (artificiel, exclusif, mixte) sur la probabilité de transmission du VIH à la naissance et au cours de l allaitement. Il a été montré que le type d allaitement n avait pas d influence significative sur la probabilité de transmission à la naissance et que l allaitement exclusif était associé à un risque de transmission largement négligeable devant l allaitement mixte. Dans un deuxième exemple sur les infections urinaires nosocomiales sur sonde en unité de soins intensifs, nous avons étudié la part des infections attribuables à la pose de la sonde et à sa présence, la pose de la sonde urinaire étant associé à un risque de transmission dix fois moins important que celui associé à la présence de la sondeTo study the delay between transmission and detection of an infectious disease, particularly in situation of multiple exposures, we have developed a survival model, adapted to interval censored data. Initially developed in cancerology, this model enables to take into account a proportion of patients who will not undergo the studied event, ie. patients who were not contaminated. In a first example that deals with mother-to-child transmission of HIV-1 in South Africa, we have studied the influence of infant feeding practices (formula, exclusive breastfeeding or mixed feeding) on the probability of HIV-1 transmission at birth and during breastfeeding. It was shown that the probability of transmission at birth was not significantly modified by early breastfeeding practice. While breastfeeding continued to be exclusive there was a negligible risk of transmission in comparison with that observed with mixed feeding. In a second example on nosocomial urinary tract infections in intensive care units, it was shown that the proportion of infections attributable to the catheter placement was ten fold smaller than the proportion of infections attributable to its long-term useLYON1-BU.Sciences (692662101) / SudocSudocFranceF

    Large-scale Machine Learning for Metagenomics Sequence Classification

    No full text
    Metagenomics characterizes the taxonomic diversity of microbial communities by sequencing DNA directly from an environmental sample. One of the main challenges in metagenomics data analysis is the binning step, where each sequenced read is assigned to a taxonomic clade. Due to the large volume of metagenomics datasets, binning methods need fast and accurate algorithms that can operate with reasonable computing requirements. While standard alignment-based methods provide state-of-the-art performance, compositional approaches that assign a taxonomic class to a DNA read based on the k-mers it contains have the potential to provide faster solutions. In this work, we investigate the potential of modern, large-scale machine learning implementations for taxonomic affectation of next-generation sequencing reads based on their k-mers profile. We show that machine learning-based compositional approaches benefit from increasing the number of fragments sampled from reference genome to tune their parameters, up to a coverage of about 10, and from increasing the k-mer size to about 12. Tuning these models involves training a machine learning model on about 10 8 samples in 10 7 dimensions, which is out of reach of standard soft-wares but can be done efficiently with modern implementations for large-scale machine learning. The resulting models are competitive in terms of accuracy with well-established alignment tools for problems involving a small to moderate number of candidate species, and for reasonable amounts of sequencing errors. We show, however, that compositional approaches are still limited in their ability to deal with problems involving a greater number of species, and more sensitive to sequencing errors. We finally confirm that compositional approach achieve faster prediction times, with a gain of 3 to 15 times with respect to the BWA-MEM short read mapper, depending on the number of candidate species and the level of sequencing noise

    A fast and agnostic method for bacterial genome-wide association studies: Bridging the gap between k-mers and genetic events.

    Get PDF
    Genome-wide association study (GWAS) methods applied to bacterial genomes have shown promising results for genetic marker discovery or detailed assessment of marker effect. Recently, alignment-free methods based on k-mer composition have proven their ability to explore the accessory genome. However, they lead to redundant descriptions and results which are sometimes hard to interpret. Here we introduce DBGWAS, an extended k-mer-based GWAS method producing interpretable genetic variants associated with distinct phenotypes. Relying on compacted De Bruijn graphs (cDBG), our method gathers cDBG nodes, identified by the association model, into subgraphs defined from their neighbourhood in the initial cDBG. DBGWAS is alignment-free and only requires a set of contigs and phenotypes. In particular, it does not require prior annotation or reference genomes. It produces subgraphs representing phenotype-associated genetic variants such as local polymorphisms and mobile genetic elements (MGE). It offers a graphical framework which helps interpret GWAS results. Importantly it is also computationally efficient-experiments took one hour and a half on average. We validated our method using antibiotic resistance phenotypes for three bacterial species. DBGWAS recovered known resistance determinants such as mutations in core genes in Mycobacterium tuberculosis, and genes acquired by horizontal transfer in Staphylococcus aureus and Pseudomonas aeruginosa-along with their MGE context. It also enabled us to formulate new hypotheses involving genetic variants not yet described in the antibiotic resistance literature. An open-source tool implementing DBGWAS is available at https://gitlab.com/leoisl/dbgwas

    A strategy to build and validate a prognostic biomarker model based on RT-qPCR gene expression and clinical covariates

    Get PDF
    Background: Construction and validation of a prognostic model for survival data in the clinical domain is still an active field of research. Nevertheless there is no consensus on how to develop routine prognostic tests based on a combination of RT-qPCR biomarkers and clinical or demographic variables. In particular, the estimation of the model performance requires to properly account for the RT-qPCR experimental design. Results: We present a strategy to build, select, and validate a prognostic model for survival data based on a combination of RT-qPCR biomarkers and clinical or demographic data and we provide an illustration on a real clinical dataset. First, we compare two cross-validation schemes: a classical outcome-stratified cross-validation scheme and an alternative one that accounts for the RT-qPCR plate design, especially when samples are processed by batches. The latter is intended to limit the performance discrepancies, also called the validation surprise, between the training and the test sets. Second, strategies for model building (covariate selection, functional relationship modeling, and statistical model) as well as performance indicators estimation are presented. Since in practice several prognostic models can exhibit similar performances, complementary criteria for model selection are discussed: the stability of the selected variables, the model optimism, and the impact of the omitted variables on the model performance. Conclusion: On the training dataset, appropriate resampling methods are expected to prevent from any upward biases due to unaccounted technical and biological variability that may arise from the experimental and intrinsic design of the RT-qPCR assay. Moreover, the stability of the selected variables, the model optimism, and the impact of the omitted variables on the model performances are pivotal indicators to select the optimal model to be validated on the test dataset

    Controlling Microgrids Without External Data: A Benchmark of Stochastic Programming Methods

    No full text
    Microgrids are local energy systems that integrate energy production, demand, and storage units. They are generally connected to the regional grid to import electricity when local production and storage do not meet the demand. In this context, Energy Management Systems (EMS) are used to ensure the balance between supply and demand, while minimizing the electricity bill, or an environmental criterion. The main implementation challenges for an EMS come from the uncertainties in the consumption, the local renewable energy production, and in the price and the carbon intensity of electricity. Model Predictive Control (MPC) is widely used to implement EMS but is particularly sensitive to the forecast quality, and often requires a subscription to expensive third-party forecast services. We introduce four Multistage Stochastic Control Algorithms relying only on historical data obtained from on-site measurements. We formulate them under the shared framework of Multistage Stochastic Programming and benchmark them against two baselines in 61 different microgrid setups using the EMSx dataset. Our most effective algorithm produces notable cost reductions compared to an MPC that utilizes the same uncertainty model to generate predictions, and it demonstrates similar performance levels to an ideal MPC that relies on perfect forecasts

    Controlling Microgrids Without External Data: A Benchmark of Stochastic Programming Methods

    No full text
    Microgrids are local energy systems that integrate energy production, demand, and storage units. They are generally connected to the regional grid to import electricity when local production and storage do not meet the demand. In this context, Energy Management Systems (EMS) are used to ensure the balance between supply and demand, while minimizing the electricity bill, or an environmental criterion. The main implementation challenges for an EMS come from the uncertainties in the consumption, the local renewable energy production, and in the price and the carbon intensity of electricity. Model Predictive Control (MPC) is widely used to implement EMS but is particularly sensitive to the forecast quality, and often requires a subscription to expensive third-party forecast services. We introduce four Multistage Stochastic Control Algorithms relying only on historical data obtained from on-site measurements. We formulate them under the shared framework of Multistage Stochastic Programming and benchmark them against two baselines in 61 different microgrid setups using the EMSx dataset. Our most effective algorithm produces notable cost reductions compared to an MPC that utilizes the same uncertainty model to generate predictions, and it demonstrates similar performance levels to an ideal MPC that relies on perfect forecasts
    corecore